# load data from previously processed data
import numpy as np
import pandas as pd
df_data = pd.read_csv('data1019.csv')
df_data.head()
| CMPLNT_NUM | CMPLNT_DATE | CMPLNT_YEAR | CMPLNT_MONTH | OFNS_DESC | BORO_NM | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|
| 0 | 408450072 | 2012-01-02 | 2012 | 1 | OFF. AGNST PUB ORD SENSBLTY & | BROOKLYN | 40.597121 | -73.941249 |
| 1 | 114547452 | 2012-01-02 | 2012 | 1 | BURGLARY | QUEENS | 40.762102 | -73.820334 |
| 2 | 579666649 | 2012-01-02 | 2012 | 1 | ARSON | BRONX | 40.851316 | -73.902117 |
| 3 | 228510203 | 2012-01-02 | 2012 | 1 | PETIT LARCENY | QUEENS | 40.709501 | -73.786483 |
| 4 | 329522444 | 2012-01-02 | 2012 | 1 | BURGLARY | BROOKLYN | 40.668598 | -73.889725 |
crime_group_df = pd.DataFrame({'Number of Crime': df_data.groupby(['OFNS_DESC']).size()}).reset_index()
df_data['OFNS_DESC'].value_counts()
PETIT LARCENY 676860
HARRASSMENT 2 514354
ASSAULT 3 & RELATED OFFENSES 424031
CRIMINAL MISCHIEF & RELATED OF 380670
GRAND LARCENY 352753
...
FORTUNE TELLING 4
KIDNAPPING AND RELATED OFFENSES 3
OFFENSES AGAINST MARRIAGE UNCL 2
UNDER THE INFLUENCE OF DRUGS 2
LOITERING FOR DRUG PURPOSES 1
Name: OFNS_DESC, Length: 70, dtype: int64
Let's visualize it!
# Bar chart that shows the amount of crime since year 2012
crime_group_df.plot(kind='bar', x='OFNS_DESC', y='Number of Crime', figsize=(20,5))
<AxesSubplot:xlabel='OFNS_DESC'>
# Line fig that shows the amount and trend of each type of crime since the year 2012
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(20, 20))
df_data.groupby(['CMPLNT_YEAR', 'OFNS_DESC']).count()['CMPLNT_NUM'].unstack().plot(ax=ax)
<AxesSubplot:xlabel='CMPLNT_YEAR'>
Let's find out which borough has the most number of crimes
# most often borough
borough_top_2012 = pd.DataFrame({'count': df_data.groupby(['BORO_NM']).size()}).reset_index()
borough_top_2012.sort_values('count', ascending=False, inplace=True)
borough_top_2012
| BORO_NM | count | |
|---|---|---|
| 1 | BROOKLYN | 1144279 |
| 2 | MANHATTAN | 926791 |
| 0 | BRONX | 833774 |
| 3 | QUEENS | 764509 |
| 4 | STATEN ISLAND | 174446 |
---Q1 ANSWER: From the outcome we know that PETIT LARCENY is the most common crime type. And the top 5 crime types are PETIT LARCENY, HARRASSMENT 2, ASSAULT 3 & RELATED OFFENSES, CRIMINAL MISCHIEF & RELATED OF, and GRAND LARCENY.
Let's visualize the trend
# by year
fig, ax = plt.subplots(figsize=(20, 5))
df_data.groupby(['CMPLNT_YEAR']).count()['CMPLNT_NUM'].plot(ax=ax)
<AxesSubplot:xlabel='CMPLNT_YEAR'>
# by month
fig, ax = plt.subplots(figsize=(20, 5))
df_data.groupby(['CMPLNT_YEAR', 'CMPLNT_MONTH']).count()['CMPLNT_NUM'].plot(ax=ax)
<AxesSubplot:xlabel='CMPLNT_YEAR,CMPLNT_MONTH'>
---Q2 ANSWER: The amount of crime is decreasing year by year! The efforts of the police system and the social education system works fine!
First, let's count the number since 2012
# the total amount by month since 2012
crime_group_mon_df = pd.DataFrame({'Number of Crime': df_data.groupby(['CMPLNT_MONTH']).size()}).reset_index()
crime_group_mon_df.sort_values('Number of Crime', ascending=False, inplace=True)
crime_group_mon_df.reset_index(drop=True)
| CMPLNT_MONTH | Number of Crime | |
|---|---|---|
| 0 | 8 | 347684 |
| 1 | 7 | 347652 |
| 2 | 5 | 339394 |
| 3 | 10 | 335357 |
| 4 | 6 | 332877 |
| 5 | 9 | 331210 |
| 6 | 4 | 312878 |
| 7 | 3 | 311535 |
| 8 | 1 | 305770 |
| 9 | 12 | 304898 |
| 10 | 11 | 302668 |
| 11 | 2 | 274565 |
crime_group_mon_df.plot.scatter(x='CMPLNT_MONTH',
y='Number of Crime',
c='DarkBlue')
<AxesSubplot:xlabel='CMPLNT_MONTH', ylabel='Number of Crime'>
# the total amount by year and month since 2012
crime_group_mon_df2 = pd.DataFrame({'Number of Crime': df_data.groupby(['CMPLNT_YEAR','CMPLNT_MONTH']).size()}).reset_index()
crime_group_mon_df2.reset_index(drop=True)
crime_group_mon_df2.head()
| CMPLNT_YEAR | CMPLNT_MONTH | Number of Crime | |
|---|---|---|---|
| 0 | 2012 | 1 | 39802 |
| 1 | 2012 | 2 | 39218 |
| 2 | 2012 | 3 | 43220 |
| 3 | 2012 | 4 | 41631 |
| 4 | 2012 | 5 | 44882 |
crime_group_mon_df2.plot.scatter(x='CMPLNT_MONTH',
y='Number of Crime',
c='DarkBlue')
<AxesSubplot:xlabel='CMPLNT_MONTH', ylabel='Number of Crime'>
We can conclude from the data and figures that July and August have the most number of crime. Criminals are more active in summer days.
Then, let's focus on the most recent year 2019 for further analysis.
# get the data of 2019
df_data_2019=df_data.loc[df_data['CMPLNT_YEAR'] == 2019]
df_data_2019.reset_index(drop=True)
df_data_2019.head()
| CMPLNT_NUM | CMPLNT_DATE | CMPLNT_YEAR | CMPLNT_MONTH | OFNS_DESC | BORO_NM | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|
| 3395512 | 442809231 | 2019-01-01 | 2019 | 1 | SEX CRIMES | BROOKLYN | 40.674583 | -73.930222 |
| 3395513 | 651750698 | 2019-01-01 | 2019 | 1 | GRAND LARCENY | MANHATTAN | 40.748831 | -73.988478 |
| 3395514 | 394629573 | 2019-01-01 | 2019 | 1 | CRIMINAL MISCHIEF & RELATED OF | BRONX | 40.827820 | -73.925931 |
| 3395515 | 967895740 | 2019-01-01 | 2019 | 1 | HARRASSMENT 2 | MANHATTAN | 40.804384 | -73.937422 |
| 3395516 | 293332550 | 2019-01-01 | 2019 | 1 | HARRASSMENT 2 | BRONX | 40.901630 | -73.896747 |
# most often offence type
type_top5 = pd.DataFrame({'count': df_data_2019.groupby(['OFNS_DESC']).size()})
type_top5.sort_values('count', ascending=False, inplace=True)
type_top5.head()
| count | |
|---|---|
| OFNS_DESC | |
| PETIT LARCENY | 87644 |
| HARRASSMENT 2 | 70972 |
| ASSAULT 3 & RELATED OFFENSES | 52581 |
| CRIMINAL MISCHIEF & RELATED OF | 46314 |
| GRAND LARCENY | 41041 |
# most offten borough
borough_top = pd.DataFrame({'count': df_data_2019.groupby(['BORO_NM']).size()}).reset_index()
borough_top.sort_values('count', ascending=False, inplace=True)
borough_top
| BORO_NM | count | |
|---|---|---|
| 1 | BROOKLYN | 129370 |
| 2 | MANHATTAN | 113649 |
| 0 | BRONX | 99002 |
| 3 | QUEENS | 90146 |
| 4 | STATEN ISLAND | 18510 |
# the total amount by month of 2019
crime_group_mon_df_2019 = pd.DataFrame({'Number of Crime': df_data_2019.groupby(['CMPLNT_MONTH']).size()}).reset_index()
crime_group_mon_df_2019.sort_values('Number of Crime', ascending=False, inplace=True)
crime_group_mon_df_2019.reset_index(drop=True)
| CMPLNT_MONTH | Number of Crime | |
|---|---|---|
| 0 | 7 | 42045 |
| 1 | 8 | 40650 |
| 2 | 5 | 40022 |
| 3 | 9 | 39878 |
| 4 | 6 | 39708 |
| 5 | 10 | 39165 |
| 6 | 3 | 36284 |
| 7 | 4 | 36247 |
| 8 | 11 | 35981 |
| 9 | 1 | 35843 |
| 10 | 12 | 33383 |
| 11 | 2 | 31770 |
crime_group_mon_df_2019.plot.scatter(x='CMPLNT_MONTH',
y='Number of Crime',
c='DarkBlue')
<AxesSubplot:xlabel='CMPLNT_MONTH', ylabel='Number of Crime'>
The 2019 data shows the same trend as history
---Q3 ANSWER: Through out the year, the crime has a curve trend. The peak is Jul and Aug, which is the busiest time of the police system.
Use geopy library to get the latitude and longitude values of New York City.
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
Collecting package metadata (current_repodata.json): done Solving environment: done # All requested packages already installed.
address = 'New York City, NY'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of New York City are 40.7127281, -74.0060152.
import folium library to visualize the data on maps
! pip install folium==0.5.0
import folium # plotting library
borough_map = folium.Map(location=[40.7127281, -74.0060152], zoom_start=10.5, tiles='Stamen Toner')
Due to the large amount of data, let's focus on Aug of 2019 to get the visualization
# get the data of Aug, 2019
df_data_201908=df_data_2019.loc[df_data['CMPLNT_MONTH'] == 8]
df_data_201908.reset_index(drop=True)
df_data_201908.head()
| CMPLNT_NUM | CMPLNT_DATE | CMPLNT_YEAR | CMPLNT_MONTH | OFNS_DESC | BORO_NM | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|
| 3657431 | 645757844 | 2019-08-01 | 2019 | 8 | DANGEROUS WEAPONS | BRONX | 40.840188 | -73.926463 |
| 3657432 | 815573324 | 2019-08-01 | 2019 | 8 | HARRASSMENT 2 | BROOKLYN | 40.664307 | -73.869972 |
| 3657433 | 620131398 | 2019-08-01 | 2019 | 8 | CRIMINAL MISCHIEF & RELATED OF | MANHATTAN | 40.710229 | -74.007746 |
| 3657434 | 831375149 | 2019-08-01 | 2019 | 8 | OTHER OFFENSES RELATED TO THEF | BROOKLYN | 40.676981 | -73.937310 |
| 3657435 | 203426514 | 2019-08-01 | 2019 | 8 | HARRASSMENT 2 | BRONX | 40.844884 | -73.891583 |
We cannot visualize all the cases due to the large amount, so we can cluster them for better visualization.
from folium.plugins import MarkerCluster
MarkerCluster()
marker_cluster = MarkerCluster().add_to(borough_map)
for lat, lon in zip(df_data_201908.Latitude, df_data_201908.Longitude):
folium.Marker([lat, lon]).add_to(marker_cluster)
borough_map
Define Foursquare Credentials and Version
CLIENT_ID = 'YQ0X3FK5BEZZJ5VJXXTN0WMWV1E533RJMYHEW5NTAK5A5NUL' # your Foursquare ID
CLIENT_SECRET = 'IQDXGDSDBZVY1RGX2J3AZO3C2VTW2WLK05GREQZWW12QZ0TX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
Your credentails: CLIENT_ID: YQ0X3FK5BEZZJ5VJXXTN0WMWV1E533RJMYHEW5NTAK5A5NUL CLIENT_SECRET:IQDXGDSDBZVY1RGX2J3AZO3C2VTW2WLK05GREQZWW12QZ0TX
Search the whole city, get the top 100 venues.
url = 'https://api.foursquare.com/v2/venues/search?near=New%20York,%20NY&client_id={}&client_secret={}&v={}&limit={}&categoryId=4d4b7105d754a06378d81259'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
LIMIT)
url
'https://api.foursquare.com/v2/venues/search?near=New%20York,%20NY&client_id=YQ0X3FK5BEZZJ5VJXXTN0WMWV1E533RJMYHEW5NTAK5A5NUL&client_secret=IQDXGDSDBZVY1RGX2J3AZO3C2VTW2WLK05GREQZWW12QZ0TX&v=20180605&limit=100&categoryId=4d4b7105d754a06378d81259'
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
results = requests.get(url).json()
results
Analize the results, get the location information.
df_results = pd.DataFrame(results['response']['venues'])
df_loc=df_results['location']
df_ll_array=np.asarray(df_loc)
df_ll_array
nearby_venues = json_normalize(df_ll_array) # flatten JSON
nearby_venues.head()
<ipython-input-64-3166e0eb3158>:1: FutureWarning: pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead nearby_venues = json_normalize(df_ll_array) # flatten JSON
| address | crossStreet | lat | lng | labeledLatLngs | postalCode | cc | city | state | country | formattedAddress | neighborhood | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 675 6th Ave | at W 21st St | 40.741739 | -73.993653 | [{'label': 'display', 'lat': 40.74173913980557... | 10010 | US | New York | NY | United States | [675 6th Ave (at W 21st St), New York, NY 1001... | NaN |
| 1 | 200 5th Ave | at W 23rd St | 40.741987 | -73.989913 | [{'label': 'display', 'lat': 40.74198696405861... | 10010 | US | New York | NY | United States | [200 5th Ave (at W 23rd St), New York, NY 1001... | NaN |
| 2 | NaN | NaN | 40.771785 | -73.833839 | [{'label': 'display', 'lat': 40.77178493034232... | 11356 | US | College Point | NY | United States | [College Point, NY 11356, United States] | NaN |
| 3 | 767 5th Ave | btwn E 58th & E 59th St | 40.763733 | -73.972744 | [{'label': 'display', 'lat': 40.7637325, 'lng'... | 10153 | US | New York | NY | United States | [767 5th Ave (btwn E 58th & E 59th St), New Yo... | NaN |
| 4 | 200 Broadway | btwn John & Fulton St | 40.710520 | -74.008965 | [{'label': 'display', 'lat': 40.71051988539366... | 10038 | US | New York | NY | United States | [200 Broadway (btwn John & Fulton St), New Yor... | NaN |
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))
50 venues were returned by Foursquare.
在地图上显示venue
for lat, lng in zip(nearby_venues.lat, nearby_venues.lng):
folium.Marker([lat, lng]).add_to(borough_map)
borough_map